"Securing the Web, One URL at a Time"
Developed as an AI/ML Showcase Project
MOHAMMAD AFROZ ALI
Data Science | Machine Learning Enthusiast
Aspiring SDE, AIML Intern
Final Semester – B.Tech (Information Technology) - 8.0/10 CGPA
Muffakham Jah College of Engineering & Technology
AI/ML • Software Engineering • Cloud Technologies
Keen on Artificial Intelligence & Machine Learning
Focus on building end-to-end solutions that combine ML with software engineering best practices
Phishing attacks represent a significant cybersecurity threat, targeting users through deceptive websites that mimic legitimate services to steal sensitive information. This project delivers an end-to-end machine learning solution to accurately identify phishing URLs using their structural and behavioral characteristics.
With the increasing sophistication of phishing attacks, traditional rule-based detection methods often fall short. Machine learning approaches can adapt to evolving threats by identifying subtle patterns in URL characteristics, enabling more robust protection for users navigating the web.
The project utilizes a comprehensive dataset containing 30 features extracted from URLs, all designed to capture different aspects of potential phishing indicators. These features include:
The target variable Result identifies URLs as either legitimate (0) or phishing (1).
Each component (ingestion, validation, transformation, training, prediction) is implemented as a separate module with clear interfaces, enabling maintainability and extensibility.
Complete automation from data ingestion to model deployment, with robust logging and exception handling at each stage to ensure reliability.
Integration with MLflow for tracking experiments, storing metrics, and managing model artifacts with version control.
Seamless deployment to AWS infrastructure with storage in S3, containerization via Docker, and CI/CD through GitHub Actions.
Strict schema enforcement using YAML configuration to ensure data quality and consistency across the pipeline.
DAGsHub integration for model versioning and experiment tracking with Git-like capabilities for ML artifacts.
The data ingestion pipeline extracts phishing URL data from a MongoDB database, which serves as the primary data source. The implementation leverages PyMongo to connect to the database and retrieve the dataset using the configured database and collection names.
# MongoDB Connection and Data Export
def export_collection_as_dataframe(self):
database_name = self.data_ingestion_config.database_name
collection_name = self.data_ingestion_config.collection_name
self.mongo_client = pymongo.MongoClient(MONGO_DB_URL)
collection = self.mongo_client[database_name][collection_name]
df = pd.DataFrame(list(collection.find()))
if "_id" in df.columns.to_list():
df = df.drop(columns=["_id"], axis=1)
df.replace({"na": np.nan}, inplace=True)
return df
After extraction, the data is split into training and test sets using scikit-learn's train_test_split function with a configurable split ratio. This ensures that model evaluation is performed on unseen data.
The raw dataset is persisted to a feature store file path for future reference and to maintain a historical record of the data used in each training run. This supports reproducibility and model versioning.
Data validation is a critical step to ensure the quality and integrity of the input data. The process involves validating the dataset against a predefined schema (schema.yaml) that specifies the expected column names and data types.
# Column Validation Function
def validate_number_of_columns(self, dataframe: pd.DataFrame) -> bool:
try:
number_of_columns = len(self._schema_config)
logging.info(f"Required number of columns: {number_of_columns}")
logging.info(f"Data frame has columns: {len(dataframe.columns)}")
if len(dataframe.columns) == number_of_columns:
return True
return False
except Exception as e:
raise NetworkSecurityException(e, sys)
The validation process also includes checking for data drift between the training and test datasets using the Kolmogorov-Smirnov statistical test. This helps identify shifts in data distribution that could impact model performance.
For each feature, the p-value from the KS test is compared against a predefined threshold (0.05). Values below this threshold indicate significant drift that may require attention.
Data transformation is a crucial step that prepares the raw input features for model training. The process focuses on handling missing values and ensuring consistent feature representation across training and inference.
The transformation pipeline uses scikit-learn's KNNImputer to handle missing values in the dataset. This approach replaces missing values with the values from the K nearest neighbors in the feature space, which preserves the relationships between features better than simpler methods like mean or median imputation.
# KNN Imputer Configuration
DATA_TRANSFORMATION_IMPUTER_PARAMS: dict = {
"missing_values": np.nan,
"n_neighbors": 3,
"weights": "uniform",
}
The imputer is configured with n_neighbors=3 and uniform weights, meaning it uses the 3 closest neighbors with equal weighting to determine the replacement value for each missing data point.
The transformation process follows these steps:
The 30 features in the dataset capture various URL and webpage characteristics that are indicative of phishing attempts. These include structural elements (presence of IP addresses, URL length), security indicators (SSL certificates, domain registration details), and behavioral aspects (redirect patterns, form submission targets).
By maintaining these features' integrity through proper transformation, the model can effectively learn the patterns that distinguish legitimate websites from phishing attempts.
The training pipeline evaluates multiple classification algorithms to identify the most effective approach for phishing detection. Each algorithm is rigorously tested with various hyperparameter configurations through GridSearchCV to determine the optimal settings.
Ensemble of decision trees that performs well on structured data with mixed feature types and importance.
Tuned parameters: n_estimators
Single decision tree offering good interpretability and feature importance rankings.
Tuned parameters: criterion
Sequential ensemble method that builds trees to correct errors of previous trees.
Tuned parameters: learning_rate, subsample, n_estimators
Linear model for binary classification that provides probability estimates.
Default parameters
Boosting algorithm that weights misclassified samples higher in subsequent iterations.
Tuned parameters: learning_rate, n_estimators
Each model undergoes extensive hyperparameter tuning to optimize performance. The implementation uses scikit-learn's GridSearchCV to perform exhaustive search over specified parameter values for each classifier.
# Parameter Grid Example
params = {
"Random Forest": {
'n_estimators': [8, 16, 32, 128, 256]
},
"Gradient Boosting": {
'learning_rate': [.1, .01, .05, .001],
'subsample': [0.6, 0.7, 0.75, 0.85, 0.9],
'n_estimators': [8, 16, 32, 64, 128, 256]
},
"AdaBoost": {
'learning_rate': [.1, .01, .001],
'n_estimators': [8, 16, 32, 64, 128, 256]
}
}
After evaluating all models with their optimal hyperparameters, the best performing model is selected based on test set performance. This model is then wrapped with the preprocessor to create a unified prediction pipeline that handles both preprocessing and inference.
A custom NetworkModel class encapsulates both the preprocessor and the trained model to ensure consistent feature transformation during inference:
class NetworkModel:
def __init__(self, preprocessor, model):
self.preprocessor = preprocessor
self.model = model
def predict(self, x):
x_transform = self.preprocessor.transform(x)
return self.model.predict(x_transform)
The project uses comprehensive classification metrics to evaluate model performance, with a focus on balanced assessment that considers both false positives and false negatives:
Harmonic mean of precision and recall, providing a balanced measure for classification performance.
Ratio of correctly predicted positive observations to total predicted positives, measuring false positive rate.
Ratio of correctly predicted positive observations to all actual positives, measuring false negative rate.
The evaluation framework assesses each model on both training and test datasets to monitor for overfitting and ensure generalization capability. The metrics are logged in MLflow for experiment tracking and comparison.
# Classification Metric Function
def get_classification_score(y_true, y_pred):
f1 = f1_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
return ClassificationMetricArtifact(
f1_score=f1,
precision_score=precision,
recall_score=recall
)
Beyond typical model evaluation, the pipeline also analyzes data drift between training and testing datasets. This helps identify potential distribution shifts that could impact model performance in production.
The Kolmogorov-Smirnov test is used to compare feature distributions, with a threshold of 0.05 for the p-value to flag significant drift.
MLflow is integrated into the training pipeline to track experiments, metrics, parameters, and model artifacts. This provides versioning capabilities and reproducibility for the machine learning workflow.
# MLflow Tracking with DAGsHub Integration
import mlflow
import dagshub
dagshub.init(repo_owner='MOHD-AFROZ-ALI', repo_name='ml-phish-detector', mlflow=True)
def track_mlflow(self, best_model, classificationmetric):
with mlflow.start_run():
f1_score = classificationmetric.f1_score
precision_score = classificationmetric.precision_score
recall_score = classificationmetric.recall_score
mlflow.log_metric("f1_score", f1_score)
mlflow.log_metric("precision", precision_score)
mlflow.log_metric("recall_score", recall_score)
mlflow.sklearn.log_model(best_model, "model")
Each training run is tracked as a separate experiment in MLflow, capturing:
The project leverages DAGsHub for remote storage of ML experiments, providing a GitHub-like interface for ML artifacts. This enables team collaboration and experiment comparison in a centralized repository.
The application is containerized using Docker to ensure consistent deployment across environments. The Dockerfile packages the entire application, including dependencies and trained models.
FROM python:3.10-slim-buster WORKDIR /app COPY . /app RUN apt update -y && apt install awscli -y RUN apt-get update && pip install -r requirements.txt CMD ["python3", "app.py"]
The containerized application can be deployed to any environment that supports Docker, ensuring consistency between development, testing, and production environments.
A CI/CD pipeline is implemented using GitHub Actions to automate the testing, building, and deployment process. The workflow is triggered on each push to the main branch and performs the following tasks:
The project utilizes AWS services for cloud deployment, with the following components:
Used for storing data artifacts, model versioning, and dataset backups. The S3 syncer module facilitates seamless file transfer between local and cloud storage.
Hosts Docker images built by the CI/CD pipeline, providing version control and secure storage for container images.
Hosts the deployed application, providing scalable compute resources for the phishing detection service.
Manages access permissions for different AWS services, ensuring secure operation with least privilege principles.
The deployed application exposes a FastAPI endpoint that accepts URL features for analysis and returns phishing probability. This RESTful API allows integration with other systems and services.
@app.post("/predict")
async def predict_route(request: Request, file: UploadFile = File(...)):
try:
df = pd.read_csv(file.file)
preprocesor = load_object("final_model/preprocessor.pkl")
final_model = load_object("final_model/model.pkl")
network_model = NetworkModel(preprocessor=preprocesor, model=final_model)
y_pred = network_model.predict(df)
df['predicted_column'] = y_pred
df.to_csv('prediction_output/output.csv')
table_html = df.to_html(classes='table table-striped')
return templates.TemplateResponse("table.html", {"request": request, "table": table_html})
except Exception as e:
raise NetworkSecurityException(e, sys)
The ML Phish Detector project demonstrates a comprehensive approach to building an end-to-end machine learning solution for cybersecurity. By combining robust data pipelines, advanced model training techniques, and automated deployment workflows, the system provides reliable phishing detection capabilities that can adapt to evolving threats.
While the current implementation provides robust phishing detection capabilities, several enhancements could further improve the system:
The ML Phish Detector project stands as a practical demonstration of MLOps principles applied to cybersecurity, showcasing how automated ML pipelines can be leveraged to address real-world security challenges at scale.
github.com/MOHD-AFROZ-ALI/ml-phish-detector
+91 9959786710
linkedin.com/in/mohd-afroz-ali
© 2025 Mohammad Afroz Ali. All rights reserved.
Built with Python, scikit-learn, MLflow, Docker, and AWS